Search Results: "Erich Schubert"

29 August 2011

Erich Schubert: ELKI 0.4 beta release

Two weeks ago, I've published the first beta of the upcoming ELKI 0.4 release. The accompanying publication at SSTD 2011 won the "Best Demonstration Paper Award"!

ELKI is a Java framework for developing data mining algorithms and index structures. It has indexes such as the R*-Tree and M-Tree, and a huge collection of algorithms and distance functions. These are all writen rather generic, so you can build all the combinations of indexes, algorithms and distances. There are evaluation and visualization modules.

Note that I'm using "data mining" in the broad, original sense that focuses on knowledge discovery by unsupervised methods such as clustering and outlier detection. Today, many people just think of machine learning and "artificial intelligence" - or even worse: large scale data collecting - when they hear data mining. But there is much more to it than just learning!

Java comes at a certain price. The latest version got already around 50% faster than the previous release just by reducing Java boxing and unboxing that puts quite some pressure on the memory management. So you could implement these things in C to become a lot faster; but this is not production software. I need code that I can put students on to work with it and extend it, this is much more important than getting the maximum speed. You can probably still use this for prototyping. See what works, then implement just that which you really need in a low level language for maximum performance.

You can do some of that in Java. You could work on a large chunk of doubles, and access them via the Unsafe class. But is that then still Java, or aren't you actually doing just plain C? In our framework, we want to be able to support non-numerical vectors and non-double distances, too. Even when they are only applicable to certain specialized use cases. Plus, generic and Java-style code is usually much more readable, and the performance cost is not critical for research use.

Release 0.4 has plenty of under the hood changes. It allows multiple indexes to exist in parallel, it support multi-relational data. There are also a dozen new algorithms, mostly from the geo/spatial outlier field, which were used for the demonstration. But for example, it also includes methods for rescaling the output of outlier detection methods to a more sensible numerical scale for visualization and comparison.

You can install ELKI on a Debian testing and unstable system by the usual "aptitude install elki" command. It will install a menu entry for the UI and also includes the command-line launcher "elki-cli" for batch operation. The "-h" flag can produce an extensive online help, or you can just copy the parameters from the GUI. By reusing Java packages such as Batik and FOP already in Debian, this also is a smaller download. I guess the package will at some point also transition to Ubuntu - since it is Java you can just download and install it anyway I guess.

22 August 2011

Erich Schubert: Missing in Firefox: anon and regular mode in parallel

The killer feature that Chrome has and Firefox is missing is quite simple: the ability to have "private" aka "anonymous" and non-private tabs open at the same time. As far as I can tell, with Firefox you can only be in one of these modes.

My vision of a next generation privacy browser would essentially allow you to have tabs in individual modes. With some simple rules for mode switching. Essentially, unknown sites should alwasy be opened in anonymous mode. Only sites that I register for should automatically switch to a tracked mode where cookies are kept, for my own convenience. And then there are sites in a safety category that should be isolated from any other contents such as my banking site. Going to these sites should require me to manually switch modes (except when using my bookmark). Embedding, framing and such things to these sites should be impossible.

On a side note, having TOR tabs would also be nice.

16 August 2011

Erich Schubert: Documenting fat-jar licenses

Dear Lazyweb,
What is the appropriate way to document the individual licenses of a fat jar (a jar archive that includes the program along with all its dependencies)?

I've been living in the happy world of Linux distributions, where one would just package the dependencies independently (or just specify which existing packages you use, actually, since most is already packaged by helpful other developers). But for the non-Linux people, a fat jar seems to be important so they can double-click it to run the application.

When building larger Java applications, you end up using a couple of external code such as various Apache Commons and Apache Batik. I'm currently including all their LICENSE.txt files in a directory named "legal", and I'm trying to make it as obvious as possible which license applies to which parts of the jar archive. Is there any best-practice of doing this? I don't want to reinvent the wheel; it'd also like to avoid any common legal pitfalls, obviously.

Feel free to respond either using the Disqus comment function or by email via erich () debian org

26 July 2011

Erich Schubert: Restricting Skype via iptables

Whenever I launch Skype on my computer, it gets banned from the university network within a few minutes; the ban expires again after a few minutes when I close Skype. This is likely due to the aggresive nature of Skype, maybe the firewalls think it is trying to do a DDoS attack. One of the known big issues of using Skype.

For Windows users, there are some known workaround to limit Skype that usually involve registry editing. These are however not available on Linux, unfortunately.

Therefore, I decided to play around with advanced iptables functionality. While you cannot match the originating process reliably (the owner match module seemed to include such functionality at some point, but it was deemed unreliable on multi-core systems). However, there are other and more efficient methods of achieving the same.

Here's my setup:

# Add a system group for Skype
addgroup --system skype
# Override permissions of skype (assuming Debian package!)
dpkg-statoverride --update --add root skype 2755  which skype

And these are the iptables rules I use:

iptables -I OUTPUT -p tcp -m owner --gid-owner skype \
    -m multiport ! --dports 80,443 -j REJECT
iptables -I OUTPUT -p udp -m owner --gid-owner skype -j REJECT

They allow outgoing connections by Skype only on ports 80 and 443, which supposedly do not trigger the firewall (in fact, this filter is recommended by our network administration for Skype).

Or wrapped as pyroman (my firewall configuration tool; aptitude install pyroman) module:

"""
Skype restriction to avoid firewall block.
Raw iptables commands.
"""
iptables(Firewall.output, "-p tcp -m owner --gid-owner skype -m multiport ! --dports 80,443 -j %s" % Firewall.reject)
iptables(Firewall.output, "-p udp -m owner --gid-owner skype -j %s" % Firewall.reject)

which I've put just after the conntrack default module, as 05_skype.py

3 July 2011

Erich Schubert: Google vs. Facebook

Let the games begin. It looks like Google and Facebook are going to fight it out.

Given the spectacular failures of Wave and Buzz, people are of course skeptic about the success of Google Plus. However, I'm rather confident it is going to stay. Here are some things I like about it:

Privacy: as far as I can tell, Plus design had started with privacy in mind, whereas for Facebook it is still an unloved child, a spare tire. Facebook keeps on getting bad reviews here; people don't get their UI and mis-share things. I read somewhere that FB is actually losing users in the US: kids who leave Facebook because their parents are getting on, and they don't want them to see what they shared with their friends.
UI: the Circles UI is very easy to use. The same functionality on Facebook is awful to manage. And the notifications UI in Plus also are a lot better than the tiny indicator in Facebook.
New stuff: Hangout and group chats are pretty interesting (well, actually I don't like Video at all ...). This puts pressure on Facebook to move, and it looks like they might present some Skype integration soon. But they need to make this really good to keep up with hangout. Just adding "Call me" buttons for all users that have specified a Skype account in their profile won't do the trick.
Future integration: Wave already had many of these things, but the wrong way. You would put a map into a chat; the Plus way will probably be to share a map session with a circle and add chat and collaborative editing there. Google "Office" is another place where it is trivial but very useful to add Plus.

Some people think that Google will not stand a chance against social giant Facebook. But after all, Google has more users - and they have tons of services people like to use. So when Google Maps has Plus integration, will the users use it to chat about where to meet, or will they go the long way and post the map URL on Facebook, without the option of updating it collaboratively?

Googles position is much stronger than many people believe, once you think about integration possibilities. The current Plus is just a fragment, the missing puzzle piece connecting the other apps. But imagine that Google now Plus-connects its various services: YouTube, Maps, Mail, Talk, ... - for them, this is just a matter of some engineering. I guess probably half of these are already in internal testing. And Facebook just can't keep up there. Sure, they do have Facebook Video. But actually, most people prefer YouTube. And while Google can integrate plus all the way there, Facebook cannot. And while Google is the master of search, Facebook is particularly weak there - they can't even properly search their own data. Google however will at some point offer an "find things that interest me" button; Sparks is just the beginning where you have to manually define your interests (which will probably remain an option due to privacy concerns!); it is way too static right now.

So essentially, Google doesn't need to copy Facebook. They just need to do what is obvious on their own products, and Facebook will have a hard time keeping up.

Plus, in my opinion, Google got the timing just right. Users aren't too happy with Facebook these days, there is just no big alternative around anymore; their friends are on Facebook and not on some other social network. Facebook doesn't seem to evolve much anymore. The mail functionality opened more security holes (apparently you can post to a group with a fake user name when you spoof the senders email address) than it contributed to functionality and usefulness. Privacy is still not in line with all countries such as Germany; but Facebook keeps telling those users essentially that they don't care. Spam and fraud still reappears every month following the same pattern again and again (Clickjacking). The search function of Facebook is still usually described as "useless" ... People waste time in games and annoy their friends with random game invitations and posts. Facebook should better make a major move now, too. More than a demo of Skype integration. But whenever Facebook changed, their users complained ...

Of course, Google+ still has a long way to go, too. There are still many things missing here, too. For example groups and events. I figure Google is already testing them in "dogfood", and they'll actually come out within the month. With groups I do not refer to the existing Google product, but to what would be "public circles" that you need to join and that are accessible to all the circle members instead of just the creator. And events are also a key function; probably one of the most used on Facebook. These may require much more careful design to integrate well with Calendar. But given the visual update of Calender these days, this may just be around the corner, too.

4 June 2011

Mike (stew) O'Connor: My Movein Script

Erich Schubert's blog post reminded me that I've been meaning to writeup a post detailing how I'm keeping parts of my $HOME in git repositories. My goal has been to keep my home directory in a version control system effectively. I have a number of constraints however. I want the system to be modular. I don't always need X related config files in my home directory. Sometimes I want just my zsh related files and my emacs related files. I have multiple machines I check email from, and on those want to keep my notmuch/offlineimap files in sync, but I don't need these on every machine I'm on, expecially since those configurations have more sensitive data. I played around with laysvn for a while, but it never really seemed comfortable. I more recently discovered that madduck had started a "vcs-home" website and mailing list, talking about doing what I'm trying to do. I'm now going with madduck's idea of using git with detached work trees, so that I can have multiple git repositories all using $HOME as their $GIT_WORK_TREE. I have a script inspired by his vcsh script that will create a subshell where the GIT_DIR, GIT_WORK_TREE variables are set for me. I can do my git operations related to just one git repository in that shell, while still operating directly on my config files in $HOME, and avoiding any kind of nasty symlinking or hardlinking. Since I am usually using my script to allow me to quickly "move in" to a new host, I named my script "movein". It can be found here. Here's how I'll typically use it:

    stew@guppy:~$ movein init
    git server hostname? git.vireo.org
    path to remote repositories? [~/git] 
    Local repository directory? [~/.movein] 
    Location of .mrconfig file? [~/.mrconfig] 
    stew@guppy:~$

This is just run once. It asks me questions about how to setup the 'movein' environment. Now I should have a .moveinrc storing the answers I gave above, I have a stub of a .mrconfig, and an empty .movein directory. Next thing to do is to add some of my repositories. The one I typically add on all machines is my "shell" repository. It has a .bashrc/.zshrc, an .alias that both source and other zsh goodies I'll generally wish to be around:

    stew@guppy:~$ ls .zshrc
    ls: cannot access .zshrc: No such file or directory
    stew@guppy:~$ movein add shell
    Initialized empty Git repository in /home/stew/.movein/shell.git/
    remote: Counting objects: 42, done.
    remote: Compressing objects: 100% (39/39), done.
    remote: Total 42 (delta 18), reused 0 (delta 0)
    Unpacking objects: 100% (42/42), done.
    From ssh://git.vireo.org//home/stew/git/shell
     * [new branch]      master     -> origin/master
    stew@guppy:~$ ls .zshrc
    .zshrc

So what happened here is that the ssh://git.vireo.org/~/git/shell.git repository was cloned with GIT_WORK_TREE=~ and GIT_DIR=.movein/shell.git. My .zshrc (along with a bunch of other files) has appeared. Next perhaps I'll add my emacs config files:

    stew@guppy:~$ movein add emacs       
    Initialized empty Git repository in /home/stew/.movein/emacs.git/
    remote: Counting objects: 77, done.
    remote: Compressing objects: 100% (63/63), done.
    remote: Total 77 (delta 10), reused 0 (delta 0)
    Unpacking objects: 100% (77/77), done.
    From ssh://git.vireo.org//home/stew/git/emacs
     * [new branch]      emacs21    -> origin/emacs21
     * [new branch]      master     -> origin/master
    stew@guppy:~$ ls .emacs
    .emacs
    stew@guppy:~$

My remote repositry has a master branch, but also has an emacs21 branch, which I can use when checking out on older machines which don't yet have newer versions of emacs. Let's say I have made changes to my .zshrc file, and I want to check them in. Since we are working with detached work trees, git can't immediately help us:

    stew@guppy:~$ git status
    fatal: Not a git repository (or any of the parent directories): .git

The movein script allows me to "login" to one of the repositories. It will create a subshell with GIT_WORK_TREE and GIT_DIR set. In that subshell, git operations operate as one might expect:

    stew@guppy:~ $ movein login shell
    stew@guppy:~ (shell:master>*) $ echo >> .zshrc
    stew@guppy:~ (shell:master>*) $ git add .zshrc                                       
    stew@guppy:~ (shell:master>*) $ git commit -m "adding a newline to the end of .zshrc"
    [master 81b7311] adding a newline to the end of .zshrc
     1 files changed, 1 insertions(+), 0 deletions(-)
    stew@guppy:~ (shell:master>*) $ git push
    Counting objects: 8, done.
    Delta compression using up to 2 threads.
    Compressing objects: 100% (6/6), done.
    Writing objects: 100% (6/6), 546 bytes, done.
    Total 6 (delta 4), reused 0 (delta 0)
    To ssh://git.vireo.org//home/stew/git/shell.git
       d24bf2d..81b7311  master -> master
    stew@guppy:~ (shell:master*) $ exit
    stew@guppy:~ $

If I want to create a brand new repository from files in my home directory. I can:

    stew@guppy:~ $ touch methere
    stew@guppy:~ $ touch mealsothere
    stew@guppy:~ $ movein new oohlala methere mealsothere
    Initialized empty Git repository in /home/stew/git/oohlala.git/
    Initialized empty Git repository in /home/stew/.movein/oohlala.git/
    [master (root-commit) 7abe5ba] initial checkin
     0 files changed, 0 insertions(+), 0 deletions(-)
     create mode 100644 mealsothere
     create mode 100644 methere
    Counting objects: 3, done.
    Delta compression using up to 2 threads.
    Compressing objects: 100% (2/2), done.
    Writing objects: 100% (3/3), 224 bytes, done.
    Total 3 (delta 0), reused 0 (delta 0)
    To ssh://git.vireo.org//home/stew/git/oohlala.git
     * [new branch]      master -> master

Above, the command movein new oohlala methere mealsothere says "create a new repository containing two files: methere, mealsothere". A bare repository is created on the remote machine, a repository is created in the .movein directory, the files are committed, and the new commit is pushed to the remote repository. New on some other machine, I could run movein add oohlala to get these two new files. The movein script maintains a .mrconfig file, so that joeyh's mr tool can be used to manage the repositories in bulk. Commands like "mr update", "mr commit", "mr push" will act on all the known repositories. Here's an example:

    stew@guppy:~ $ cat .mrconfig
    [DEFAULT]
    include = cat /usr/share/mr/git-fake-bare
    [/home/stew/.movein/emacs.git]
    checkout = git_fake_bare_checkout 'ssh://git.vireo.org//home/stew/git/emacs.git' 'emacs.git' '../../'
    [/home/stew/.movein/shell.git]
    checkout = git_fake_bare_checkout 'ssh://git.vireo.org//home/stew/git/shell.git' 'shell.git' '../../'
    [/home/stew/.movein/oohlala.git]
    checkout = git_fake_bare_checkout 'ssh://git.vireo.org//home/stew/git/oohlala.git' 'oohlala.git' '../../'
    stew@guppy:~ $ mr update
    mr update: /home/stew//home/stew/.movein/emacs.git
    From ssh://git.vireo.org//home/stew/git/emacs
     * branch            master     -> FETCH_HEAD
    Already up-to-date.
    mr update: /home/stew//home/stew/.movein/oohlala.git
    From ssh://git.vireo.org//home/stew/git/oohlala
     * branch            master     -> FETCH_HEAD
    Already up-to-date.
    mr update: /home/stew//home/stew/.movein/shell.git
    From ssh://git.vireo.org//home/stew/git/shell
     * branch            master     -> FETCH_HEAD
    Already up-to-date.
    mr update: finished (3 ok)
    stew@guppy:~ $ mr update

There are still issues I'd like to address. The big one in my mind is that there is no .gitignore. So when you "movein login somerepository" then run "git status", It tells you about hundreds of untracked files in your home directory. Ideally, I just want to know about the files which are already associated with the repository I'm logged into.

Mike (stew) O'Connor: My Movin Script

    stew@guppy:~$ movein init
    git server hostname? git.vireo.org
    path to remote repositories? [~/git] 
    Local repository directory? [~/.movein] 
    Location of .mrconfig file? [~/.mrconfig] 
    stew@guppy:~$

    stew@guppy:~$ ls .zshrc
    ls: cannot access .zshrc: No such file or directory
    stew@guppy:~$ movein add shell
    Initialized empty Git repository in /home/stew/.movein/shell.git/
    remote: Counting objects: 42, done.
    remote: Compressing objects: 100% (39/39), done.
    remote: Total 42 (delta 18), reused 0 (delta 0)
    Unpacking objects: 100% (42/42), done.
    From ssh://git.vireo.org//home/stew/git/shell
     * [new branch]      master     -> origin/master
    stew@guppy:~$ ls .zshrc
    .zshrc

    stew@guppy:~$ movein add emacs       
    Initialized empty Git repository in /home/stew/.movein/emacs.git/
    remote: Counting objects: 77, done.
    remote: Compressing objects: 100% (63/63), done.
    remote: Total 77 (delta 10), reused 0 (delta 0)
    Unpacking objects: 100% (77/77), done.
    From ssh://git.vireo.org//home/stew/git/emacs
     * [new branch]      emacs21    -> origin/emacs21
     * [new branch]      master     -> origin/master
    stew@guppy:~$ ls .emacs
    .emacs
    stew@guppy:~$

    stew@guppy:~$ git status
    fatal: Not a git repository (or any of the parent directories): .git

The movein script allows me to "login" to one of the repositories. It will create a subshell with GIT_WORK_TREE and GIT_DIR set. In that subshell, git operations operate as one might expect:

    stew@guppy:~ $ movein login shell
    stew@guppy:~ (shell:master>*) $ echo >> .zshrc
    stew@guppy:~ (shell:master>*) $ git add .zshrc                                       
    stew@guppy:~ (shell:master>*) $ git commit -m "adding a newline to the end of .zshrc"
    [master 81b7311] adding a newline to the end of .zshrc
     1 files changed, 1 insertions(+), 0 deletions(-)
    stew@guppy:~ (shell:master>*) $ git push
    Counting objects: 8, done.
    Delta compression using up to 2 threads.
    Compressing objects: 100% (6/6), done.
    Writing objects: 100% (6/6), 546 bytes, done.
    Total 6 (delta 4), reused 0 (delta 0)
    To ssh://git.vireo.org//home/stew/git/shell.git
       d24bf2d..81b7311  master -> master
    stew@guppy:~ (shell:master*) $ exit
    stew@guppy:~ $

If I want to create a brand new repository from files in my home directory. I can:

    stew@guppy:~ $ touch methere
    stew@guppy:~ $ touch mealsothere
    stew@guppy:~ $ movein new oohlala methere mealsothere
    Initialized empty Git repository in /home/stew/git/oohlala.git/
    Initialized empty Git repository in /home/stew/.movein/oohlala.git/
    [master (root-commit) 7abe5ba] initial checkin
     0 files changed, 0 insertions(+), 0 deletions(-)
     create mode 100644 mealsothere
     create mode 100644 methere
    Counting objects: 3, done.
    Delta compression using up to 2 threads.
    Compressing objects: 100% (2/2), done.
    Writing objects: 100% (3/3), 224 bytes, done.
    Total 3 (delta 0), reused 0 (delta 0)
    To ssh://git.vireo.org//home/stew/git/oohlala.git
     * [new branch]      master -> master

    stew@guppy:~ $ cat .mrconfig
    [DEFAULT]
    include = cat /usr/share/mr/git-fake-bare
    [/home/stew/.movein/emacs.git]
    checkout = git_fake_bare_checkout 'ssh://git.vireo.org//home/stew/git/emacs.git' 'emacs.git' '../../'
    [/home/stew/.movein/shell.git]
    checkout = git_fake_bare_checkout 'ssh://git.vireo.org//home/stew/git/shell.git' 'shell.git' '../../'
    [/home/stew/.movein/oohlala.git]
    checkout = git_fake_bare_checkout 'ssh://git.vireo.org//home/stew/git/oohlala.git' 'oohlala.git' '../../'
    stew@guppy:~ $ mr update
    mr update: /home/stew//home/stew/.movein/emacs.git
    From ssh://git.vireo.org//home/stew/git/emacs
     * branch            master     -> FETCH_HEAD
    Already up-to-date.
    mr update: /home/stew//home/stew/.movein/oohlala.git
    From ssh://git.vireo.org//home/stew/git/oohlala
     * branch            master     -> FETCH_HEAD
    Already up-to-date.
    mr update: /home/stew//home/stew/.movein/shell.git
    From ssh://git.vireo.org//home/stew/git/shell
     * branch            master     -> FETCH_HEAD
    Already up-to-date.
    mr update: finished (3 ok)
    stew@guppy:~ $ mr update

2 June 2011

Erich Schubert: Managing user configuration files

Dear Lazyweb,
How do you manage your user configuration files? I have around four home directories I frequently use. They are sufficiently well enough in sync, but I have been considering to actually use some file management to synchronize them better. I'm talking about files such as shell config, ssh config, .vimrc etc.

I had some discussions about this before, and the consensus had been that some version control system probably is best. Git seemed to be a good candidate; I remember having read about things like this a dozen years ago when CVS was still common and Subversion was new.

So dear lazyweb, what are your experiences with managing your user configuration? What setup would you recommend?

Update: See vcs-home for various related links and at least five different ways of doing this. mr, a multi-repository VCS wrapper seems particularly well at this.

27 May 2011

Erich Schubert: Dear Lazyweb, how to write multi-locale python code

Dear Lazyweb,
I've been toying around with a python WSGI application, i.e. a multi-threaded persistent web application. Now I'd like to add multi-language support to this application. I need to format datetimes to human readable formats, but I havn't found a way yet to do this in a sane way using strftime. Essentially, strftime will use the current application locale; however since I'm running multi-threaded, different threads might want to use different locales. So changing the locale is bound to cause race conditions.

So what is the best way to pretty-print (including week day names!) datetime, currency and similar values in a multi-threaded multi-locale context in python? Gettext and manually emulating strftime doesn't sound that sensible to me. And of course, I don't want to have to translate the weekday names myself into any language I choose to support...

12 May 2011

Erich Schubert: AMD64 broken on Debian unstable - avoid libc6 2.13-3

Beware from upgrading on AMD64. Make sure to avoid version 2.1.3-3, as this will render your system unbootable and unusable. As simple as the reason is (a missing link) as severe.

Bug report with instructions on how to recover. If you are lucky you have a root shell open to restore the missing link. Otherwise, you need to reboot with parameters break=init rw, recover the link with cd root; ln -s lib lib64, sync, unmount, reboot. It's not really hard to do when you know how. But it is a lot easier to avoid upgrading to this version. My i386 mirror already has the fixed upload (but i386 is not affected anyway). So by tomorrow, it should be safe again (depening on your mirrors delay).

5 May 2011

Erich Schubert: Upcoming publications in data mining

Upcoming 2011 publications of my research:

Just presented at the SDM11 last weekend:

H.-P. Kriegel, P. Kr ger, E. Schubert, A. Zimek
Interpreting and Unifying Outlier Scores
In Proceedings of the 11th SIAM International Conference on Data Mining (SDM), Mesa, AZ, 2011.

To be presented and published end of August:

T. Bernecker, M. E. Houle, H.-P. Kriegel, P. Kr ger, M. Renz, E. Schubert, A. Zimek
Quality of Similarity Rankings in Time Series
In Proceedings of the 12th International Symposium on Spatial and Temporal Databases (SSTD), Minneapolis, MN, 2011.

E. Achtert, A. Hettab, H.-P. Kriegel, E. Schubert, A. Zimek
Spatial Outlier Detection: Data, Algorithms, Visualizations
In Proceedings of the 12th International Symposium on Spatial and Temporal Databases (SSTD), Minneapolis, MN, 2011.

The latter will also accompany the release of version 0.4 of our data mining research software ELKI.

28 April 2011

Erich Schubert: SIAM SDM11 - Unified Outlier Scores

I'm currently in Phoenix, AZ at the 2011 SIAM International Conference on Data Mining.

My contribution is titled "Interpreting and Unifying Outlier Scores", a method that allows the combination, interpretation, visualization etc. of existing outlier algorithms. The method brings back a bit more statistics into a data mining area that has drifted away from the statistical roots.

We apply the method to a couple of outlier detection algorithms and combine them using a naive ensemble approach, that still outperforms existing outlier ensembles.

14 March 2011

Erich Schubert: GNOME3 in Debian experimental - python and dconf

As GNOME3 slowly enters Debian experimental, things become a bit ... experimental.

The file manager can be set to still draw icons on the desktop, but that doesn't entirely work yet (it will also open folders as desktop then...)

One machine had lost the keyboard settings. I could not set the fonts I wanted...

There is a tool called dconf-editor that will allow you to manually tweak some settings such as the fonts. But it doesn't seem to have support for value lists yet - and the keyboard mappings setting is a string list.

So here's sample python code to modify such a value:

from gi.repository import Gio
s = Gio.Settings.new("org.gnome.libgnomekbd.keyboard")
s.set_strv("layouts", ["de"])

Update: you could also install the optional libglib2.0-bin and use the gsettings command.

Erich Schubert: What is really happening at Fukushima?

As far as I can tell, neither the Japanese government, nor the operating companies tell the truth.

Here's my take of the story:

The tsunami not only destroyed the generators, but the complete cooling systems.
Thus, the cores will overheat and melt, they have no way to prevent this, they can only try to keep the damage as low as possible.
The cooling water turned into gas and disassociated into hydrogen and oxygen. Unfortunately, this is highly explosive. So the best they can do is to try to get the gas out of the core containment, and let is explode outside where it does not cause too much radioactive pollusion (are core explosion would be really bad). This already happened at reactors 1 and 3, and will happen at reactor 2 the next one or two days.
They use the sea water to slow down the core meltdown and keep the core containment stable. They probably do this by flooding the second containment, where there it no direct radioactive materials.

So essentially, I don't see much of a risk of a nuclear explosion, and I expect the radioactive pollution to be quite low, mostly via indirect radiation from the coolant in the outer containment. The reactor however is trashed and full of highly radioactive waste, that will require constant cooling for the next few years.

However, I'm really concerned that apparently, the government and the companies involed lied to us. This is happening a lot when it comes to nuclear power, they barely are truthful about what is happening.

Nuclear power is only safe as long it is operated by altruistic and responsible people. Once you bring free market, politics and money in, it is a dangerous toy that should not be toyed with by humans.

Erich Schubert: Google Circles rumor

The media are spreading a rumor about a social networking platform by Google, called Circles.

So far, these things have been debunked.

I do not believe that Google could succeed with a "Facebook clone". The market is already taken, Facebook is too big and had successfully been taking away international markets from local clones. For example in Germany, the other social networks used to be a lot bigger, but Facebook overtook them in just a few months, and the people I know pretty much quit the other networks then.

And I assume Google is aware that unless they have a strong strategy, the network would go the Wave-Buzz route.

At least here in Germany, people are surprisingly concerned by the amount of data Google might have on them, while at the same time they give it to Facebook for free. This is probably due to the media attention received with StreetView. It's not fair, but that's life.

However, "social network" is a broad term. Just think of what people actually use facebook for:

Games (often even not with their social circle!)
Microblogging
Photo sharing
Email
Automatic address book

Now, if you look closely at this list, Google pretty much has all of these, if we ignore the browser games (and they are all over the internet by now, Facebook and without!), then there is just one key ingredient missing for Google: the address book thing.

I believe this "Google circles" thing will largely by a "smart address book" that helps people managing their social contacts in a social network style. And obviously, this can be integrated in various products such as Mail, Picasa, Buzz, ...

If Google manages to launch a "contact manager" that makes it really easy for people to manage their social circles, this can be quite a killer. Facebook has "lists" but they are awful to use. It likes to "hide" friends to reduce the amount of information it throws at you. But it doesn't really organize it for you, for example by social circles or topics. These days for example Facebook could split the news feed into "Japan" and "everything else" I guess.

22 February 2011

Erich Schubert: Taking Google Calendar to the limits

The Global Lindy Hop Map I've built as a toy project is actually a calendar with geo annotated events. It currently is backed by a custom database using Xapian for the search functionality to improve performance.

The data comes from around 150 Google calendars from various dancing communities. I'm preprocessing the data to have reliable geo information as well as doing some filtering and HTML formatting.

Instead of putting everything into my own database - which doesn't know about recurrence rules, but relies on materialized recurrences - I've also tried to sync all 150 calendars into one huge "master" calendar.

However, using the Google Gdata APIs to access this calendar takes way too long for the website to be usable. This is not too surprising, there are like 3500 instances in the calendar and some of this will require the computation of recurrence rules. And I cannot do the synchronization without a local ID mapping cache anyway (I can recover a lost cache from additional information I put into the calendar though). There are around 60 event instances per day, since most come from weekly repetitions.

The HTML embed rendering can take quite a while to load (albeit the results apparently are cached somewhere) - looking at december either hits some processing limit or times out after around 40 seconds.

Looks like I'm going beyond the scope Google Calendar was designed for. :-)

12 February 2011

Erich Schubert: Joerg Schilling stilly spreading FUD

J rg Schilling is still spreading FUD (currently on an openSuSE list):

There is a social issue with Debian that attacks OSS projects _because_ they use the GPL. Please do not follow these attacks without asking a lawyer.

The removal of cdrecord has been the best reaction to these issues, since apparently working with him is impossible. In this thread, he again manages to accuse everybody else of lying and being incompetent. This clearly shows that you cannot work with him, and the only viable way is ignoring him as far as possible.

It also speaks of a hurt ego in a probably narcisstic person (after all, he seems to think everybody else is incompetent and lying - but I am not a psychologist, so this is not a diagnosis!).

Let me just point out one fact:
By not shipping his current cdrecord, we can obviously not violate any even just potentially invalid license. The way Debian is dealing with this issue is legally undoubtedly correct: obviously, we are not oblieged to include his software.

On the long run, I assume that Debian will also get rid of cdrkit/wodim. In my opinion, this is just a legacy code until we can switch to libburnia entirely, and we should now try to drop cdrkit/wodim as well in order to ship "wheezy" without it.

Since we cannot work with J rg, the only sane way is to try to completely remove any code he wrote from our systems so we have no reason whatsoever to communicate with him.

cdrecord is dead, long live libburnia.

I expect to see lots of FUD from him in the post comments any time soon ... but I do not care: I don't use his software. My current laptop doesn't even have a CD drive ... and despite his claims, the Debian versions of cdrecord and wodim worked perfectly for me when I was still burning CDs.

My only advice to you is to ignore him as far as possible. It will not get you anywhere, it is just a waste of time. In fact, I shouldn't have wasted the time it took me to write this blog post.

6 February 2011

Erich Schubert: Debian Squeeze released!

Debian GNU/Linux 6.0 "squeeze" has been released today.

Congratulations to everyone involved in ironing the last few bugs out.

(My own involvement had been a bit limited recently, but I at least kept my few remaining packages in a ready-to-release state and helped with the occasional bug report and patch.)

Some people think that Ubuntu is the better Debian - I do NOT. Debian is a fun place, has great people working on it and is true to its aims at creating a truly free and high-quality distribution. The long release cycles of Debian are a feature, not a bug. Stable is for production systems, not toy projects.

The proper way of attributing Debian stable is conservative and sustainable, but not outdated. It actually can do everything you need - and will do so in 10 years.

If you have been using Open Source for as long as me (say 15+ years) you will probably have seen software hypes come and go. The one thing that has been always the same was Debian: dead reliable. There was a time when everyone was crazy about enlightement for it's shiny pretty UI. Almost like Compiz-Beryl just two years ago. It came, as a matter of fact Debian also had it, but it also went.

I also remember how people complained that Debian didn't ship Xgl back in 2006 when this was the latest hype (there was no Debian release in 2006). Well, Xgl died in 2008. The features remained, but done in a much nicer way, and also found their way into Debian. In fact, I also ran Xgl at least once, on Debian, but just not on "stable". One could say that Xgl never was quite "stable", was it?

Debian stable is good the way it is: an administrators choice. Of course, developers might have different needs, but there also is testing, unstable and experimental. Just make sure to align your choice with your needs. And sometimes, also rethink your needs: there is no "latest beta versions" and "stable platform" at the same time.

28 January 2011

Erich Schubert: Google Research Awards

Any Googlers reading this maybe? Anyone from MapReduce or Image Labeler would be best.

I'm looking for someone to sponsor a "research award" application for me. They're written and ready to hand in; but I've read that applications have much better chances when they have a "sponsor" at Google, so someone that is familiar with the topic that considers them to be interesting.

One is about the behaviour of large high-dimensional sets, and we just don't have access to appropriate data. A data excerpt from Google Image Labeler would help us a lot. (No budget, just some data.)

The second is about adapting some outlier detection methods to scale in linear time using a MapReduce cluster that would usually run in quadratic time and thus not be usable for many situations. The first results in this (I already have some students work on this for me) look very promising; but I'd like to have a PhD student like me to continue this work, where some aspects go beyond what is doable in the regular 6-month theses.

If you are interested, please contact me at erich AT debian org with your google address to discuss the ideas. From earlier experience I have the impression that without an interested Google contact my chances are rather low in getting accepted.

17 December 2010

Erich Schubert: Is ORACLE trying to break up with Open Source?

It seems like ORACLE is trying to break up with all developer communities that supported and promoted their products for years. Here are some recent developments I've read about in the media:

OpenSolaris community breakup (forked to IllumOS)
Solaris losing e.g. Bryan Cantrill (Dtrace author), Mike Shapiro (ZFS), Jeff Bonwick (ZFS, slab allocator, LZJB)
Apache Foundation leaves the Java Community Process EC
Java also losing James Gosling (the father of the Java language), Simon Phipps and Chris Melissinos
OpenOffice (forking to LibreOffice, and many key contributors being forced to leave OpenOffice by ORACLE)
MySQL losing Brian Aker
... and many more

So what will they break next?

Will the next free JVM still have all the features, or will we have to abandon Java unless we pay for their premium JVM?

Next.

Previous.